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Abstract 

RSVDPACK is a library of functions for computing low rank approximations of matrices. The 
library includes functions for computing standard (partial) factorizations such as the Singular Value 
Decomposition (SVD), and also so called “structure preserving” factorizations such as the Inter¬ 
polative Decomposition (ID) and the CUR decomposition. The ID and CUR factorizations pick 
subsets of the rows/columns of a matrix to use as bases for its row/column space. Such factor¬ 
izations preserve properties of the matrix such as sparsity or non-negativity, are helpful in data 
interpretation, and require in certain contexts less memory than a partial SVD. The package imple¬ 
ments highly efficient computational algorithms based on randomized sampling, as described and 
analyzed in N. Halko, P. G. Martinsson, J. Tropp, “Finding structure with randomness: Probabilis¬ 
tic algorithms for constructing approximate matrix decompositions,” SIAM Review, 53(2), 2011, 
and subsequent papers. This manuscript presents some modifications to the basic algorithms that 
improve performance and ease of use. The library is written in C and supports both multi-core 
CPU and GPU architectures. 


1 Introduction 

This manuscript describes a collection of functions for computing low-rank approximations to matrices. 
In other words, given an m x n matrix A stored in RAM, we seek to compute an approximation 
Aapprox of rank k < min(m, n), represented in factored form. We consider the case where Aapprox is an 
approximate singular value decomposition (SVD), and also the case where Aapprox is represented in 
a so called “structure preserving” factorization such as the CUR or interpolative decompositions, see 
[2, 18, 13]. The problems addressed arise frequently in scientific computing, data analysis, statistics, 
and many other areas. 

Among the different factorizations, the partial singular value decomposition is known to be optimal 
in the sense that for any given rank, it results in a minimal error || A — Aapprox ||) as measured in either 
the ^ 2 -operator norm, or the Frobenius norm. The interpolative and CUR decompositions provide for 
larger than minimal error at any given rank, but preserve certain useful properties such as sparsity 
and non-negativity. 
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The algorithms used are based on randomized sampling, and are highly computationally efficient. 
In particular, the developed software aims at reduced communication cost and good scalability on 
multi-core/processor systems. 

The SVD algorithms used here were originally published in [15], were later extended in [12] and 
analyzed and surveyed in [8]. For other decompositions, we have made use of more recent results from 
[14] and [18], which were inspired by [13]. Related work is reported in [1, 17]. In our development, 
we made some modifications to previously published versions and implemented what we believe to be 
the most computationally efficient and practical algorithmic variants for use with applications. 

To introduce the idea of randomized algorithms for computing low rank approximations to matrices, 
we show in Figure 1 a basic randomized algorithm called RSVD for computing an approximation to 
the dominant k modes in a singular value decomposition (SVD) of a given matrix A. The algorithm 
shown is intended for use in the case where the rank k is much smaller than the matrix dimensions, 
k <C min(m, n). In this environment, RSVD tends to execute very fast since all interactions with 
the large matrix A happen only through the matrix-matrix multiplications on lines (2) and (4). The 
matrix-matrix factorization is a communication efficient algorithm for which highly optimized software 
is available on most computing platforms. In particular, the matrix-matrix multiplication executes 
very fast on modern multi-core CPUs and massively multi-core GPUs. All operations in the algorithm 
that are not matrix-matrix multiplications involve small matrices that have either roughly k rows or 
roughly k columns (to be precise, they have k + p rows or columns, where p is a small “over-sampling 
parameter” that we typically set to 5 or 10). 

In this paper, we describe efficient implementations of the algorithm shown in Figure 1, as well as 
some algorithms with additional features that extend the range of problems that can be handled. These 
algorithms include functions that achieve high computational efficiency in cases where the numerical 
rank of the matrix is not known in advance, and instead must be determined as part of the computation 
(given a requested tolerance). They also include variations of the basic algorithm that incur slightly 
higher computational costs, but in return produce close to optimally accurate results even for matrices 
with “noisy” entries such as, e.g., measured statistical data. (To be precise, these modified algorithms 
are designed for matrices whose singular values decay slowly.) The high computational performance 
attained by these algorithms can be largely attributed to one recurring idea: 

Key idea: Use randomization to cast as much of the computation as possible in terms of 
highly efficient matrix-matrix multiplications. 

The algorithms we discuss can readily be implemented directly in Matlab, which for many users 
may be sufficient. We remark also that in recent time, other software for randomized decompositions 
has been developed, for example, in the form of routines for R [5]; as well as codes in Fortran [16] and 
Python [9]. The C based routines in RSVD PACK are meant to be used for larger sized applications 
where computational efficiency and parallel scalability are important and include optimizations for 
multi-core processors and GPUs. The codes also incorporate some of the latest randomized methods 
for SVD, ID, and CUR computations refined by the authors and provide the possibility to use an input 
tolerance parameter instead of a fixed rank. 

The manuscript is organized as follows: Section 2 lists known facts about matrix factorizations and 
the low-rank approximation problem that we need. Some of these facts are standard results, and some 
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(1) Draw an n X {k + p) Gaussian random matrix G. G = randn(n,k+p) 

(2) Form the m x (k + p) sample matrix Y = AG. Y = A*G 

(3) Form an m x {k + p) orthonormal matrix Q such that Y = Q R. [Q, R] = qr(Y,0) 

(4) Form the {k + p) x n matrix B = Q*A. B = Q’*A 

(5) Compute the SVD of the small matrix B: B = U DV*. [Uhat, D, V] = svd(B, 'econ’) 

(6) Form the matrix U = QU. U = Q* Uhat 

(7) Truncate the trailing p terms. U = U(:,l:k); V = V(:,l:k); D = D(l:k,l:k) 

Figure 1: A randomized algorithm for computing an approximate singular value decomposition of a 
given matrix. The inputs are an m x n matrix A, a target rank k, and an over-sampling parameter p 
(the choice p = 5 is often very good). The outputs are orthonormal matrices U and V of sizes m x k 
and n X k, respectively, and a k x k diagonal matrix D such that A ~ UDV*. 


are perhaps less well known, in particular facts regarding the “structure preserving” factorizations. 
Section 3 reviews how randomized algorithms can be used to compute low-rank approximations to 
matrices, and also includes some extensions and modifications that have not previously been published. 
Section 4 describes the functionality of the RSVDPACK software. Section 5 shows the results of 
numerical experiments that illustrate the speed and accuracy of our software. Section 6 summarizes 
our key findings and discusses future work. Section 7 describes the license terms and availability of 
the software. 


2 Matrix decompositions 

This section introduces our notation, and describes the full and low rank decompositions which we 
will use. We describe the singular value decomposition (SVD), the column pivoted QR decomposition, 
the one and two sided interpolative decompositions (IDs), and the CUR decomposition. In terms of 
approximation error for the rank k decompositions, the truncated SVD is best, followed by the QR and 
ID decompositions (with identical errors) and then by the CUR. In terms of memory requirements for 
dense matrices, the two ID decompositions of A require the least space, followed by the SVD and the 
CUR. However, if A is a sparse matrix and a sparse storage format is used for the factor matrices, 
the ID and CUR decompositions can be stored more efficiently than the SVD. In the sparse case, the 
CUR storage requirement will in many cases be minimal amongst all the factorizations. The details of 
the factorizations appear in the subsections below, while the pseudocode for the algorithms to compute 
the one sided ID, two sided ID, and CUR factorizations appear in Appendix A. 

For further details, the material on the SVD is covered in most standard textbooks, e.g., [6]. The 
ID and CUR decompositions are described in further detail in, e.g., [2, 14, 18, 13, 1, 17]. 

2.1 Notation 

In what follows, we let A be a matrix with real entries. The extension to the complex case is straight¬ 
forward in principle, but our code does not yet have this capability implemented. The transpose of a 
matrix A is denoted A* to simplify the extension to complex matrices. The norms || • \\f and || • ||2 
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refer to the Frobenius and the spectral (operator (. 2 ) matrix norms, respectively. In relations where 
either matrix norm can be used, we write || • ||. For vectors, || • || refers to the usual Euclidean norm. 
By 7^(A) we refer to the set which is the range or column space of matrix A. We say that a matrix 
is orthonormal (ON) if its columns form an orthonormal set. We use the notation orth to refer to an 
unpivoted QR factorization. In other words, given a matrix A of size m x n with m > n, the matrix 
Q = orth(A) is an m X n ON matrix whose columns form an orthonormal basis for the columns of A. 
(Using Matlab notation, the operation orth can be implemented via compact QR factorization using 
the syntax [Q,~] = gr(A,0); observe that this is closely related to, but not identical to, the native 
function orth in Matlab.) We use Matlab style indexing to refer to matrix row or column extraction. 
Thus, V(1 : a, 1 : b), refers to a submatrix formed by extracting the first a rows and b columns of V. By 
Jr and Jc we denote index (integer) vectors of row and column numbers of A, corresponding to some 
particular rearrangement. We let N(0, 1) denote a normalized Gaussian probability distribution, and 
use the term GIID matrix to refer to a matrix whose entries are drawn independently from N(0,1). 
Using Matlab notation, an m x n GIID matrix is generated via R = randn{m,n)). The expectation 
of a random variable is denoted E[... ] and the variance by Var[... ]. 


2.2 The singular value decomposition 

Let A be an m X n matrix with real entries. Setting r = min(m, n), every such matrix admits a so 
called “economic singular value decomposition (SVD)” of the form 


A = U E V*, 

m X n m X r r x r r x n 


( 2 . 1 ) 


where U and V are orthonormal matrices and E is a diagonal matrix. The columns (uj)^^^ and 
{vjYj=i of U and V are called the left and right singular vectors of A, respectively, and the diagonal 
entries of E are the singular values of A. The singular values of A are ordered so that 

O'! > (12 > ■ ■ ■ > CTr > 0 . In other words. 


U = [ui U2 • • • Ur] , V = [vi V2 • • • Vr] , 


and 


E = diag((Ti, £72, ..., (Jr). 


The factorization (2.1) can be viewed as expressing A as a sum of p rank-one matrices A = YYj=i ^*j- 
In the setting of this article, we are primarily interested in the case where the singular values CTj decay 
relatively rapidly to zero, meaning that the sum converges rapidly. In this case, it is often helpful to 
approximate A using an approximation A^ ~ A defined by the truncated sum 


Afc = Vj = Ufc SIfc Ml, 

i=i 


( 2 . 2 ) 


where k is a number less than r, and 

Ufc = [ui U2 • • • Ufc] , Vfc = [vi V2 • • • Vfc I , 


and 


Efc = diag(cri, £72, ... , £7fc). 


It is well known that the truncated SVD A^ is the most accurate of all rank-A: approximations to A, 
in the following sense [4]: 
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Theorem 2.1 (Eckart-Young) Let A be an m x n matrix with singular value decomposition (2.1). 
Then for any k such that 1 < k < r, the truncated SVD as defined by (2.2) is the optimal 
approximation to A in the sense that 


||A — Afcll = inf{||A — B|| : B has rank k}, 

where || • || is either the ((^-operator norm or the Frobenius norm. The minima are given by, 
|A - Afc ||2 = o-fc+i, and || A - AfeUi? = | ^ u? 

\j=k+i 


(2.3) 


2.3 The column pivoted QR factorization and low rank approximation 

Let A be an m X n matrix with real entries as before and set r = min(m, n). The column pivoted 
QR-factorization (CPQR) of A takes the form 


A P = Q S. 

m X n n X n m x r r x n 


(2.4) 


where P is a permutation matrix, Q has orthonormal columns, and S is upper triangular. (The upper 
triangular factor is more commonly written R but we use S to avoid confusion with the factors in the 
CUR decomposition.) 

The QR factorization is commonly computed via iterative algorithms such as Gram-Schmidt or 
Householder QR [6, Sec. 5.2], which proceed via a sequence of rank-1 updates to the matrix. When 
column pivoting is used, the process can be halted after k steps to produce a rank-A: approximation 
Aapprox to A. To illustrate, suppose that we have completed k steps of the QR-factorization process, 
and partition the resulting Q and S to split off the first k columns and rows: 


r—k 


Q = m [ Qi Q2 ] , and S = 

We can write (2.4) as 

A = [Qi Q 


k 

r—k 


k n—k 

Sii S12 

0 S22 


Sii S12 

0 S22 


Ql[Sii Si 2 ]P*+ Q 2 [ 0 S 22 


=:Aa 


“remainder term” 


(2.5) 


The approximation error is now given by the following simple relation 

||A — Aapproxll = IIQ2 [0 S22] P*|| = II [0 S22] II = ||S 22 ||- 

Computing a rank k approximation via a partial QR factorization is typically much faster than com¬ 
puting a partial singular value decomposition. The price one pays is that the approximation error 
gets larger. In situations where the singular values of A exhibit robust decay, the sub-optimality is 
typically very modest [7, 2], but for certain rare matrices, substantial sub-optimality can result [10]. 
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2.4 The one-sided Interpolative Decomposition (ID) 

The one-sided interpolative decomposition can be obtained by a slight amount of post-processing of a 
partial CPQR. As a starting point, let us consider the situation (2.5) that we find ourselves in after k 
steps of the QR factorization process. For this discussion, it is convenient to represent the permutation 
matrix P using an index vector Jc G where 

Jc — ["^skeh «^res] /q 

Ixn lxfclx(n — A:)’ 

so that P = !(:, Jc), where I is the n x n identity matrix. Then AP = A(:, Jc). We can now write the 
rank k approximant Aapprox that was defined in (2.5) as 

Aapprox = Ql [Sll S12] P* = QlSii[lfc Sri'Si 2 ]P* = QlSii[h T]P*, (2.7) 

where T; is the k x (n — k) matrix resulting from solving the linear system 


SiiT; = Si2. 


Observe that Sn necessarily has rank k and is in consequence invertible. (If the rank of Sn would 
be less than k, then the exact rank of A would also be less than k and we would have halted the QR 
factorization earlier.) 

From (2.5) it follows that: 

A(:,Jc) = Ql [Sii S12] + Q2 [O S22] 

k n—k 

= m 1 ^ QiSii Q1S12-l-Q2S22 ]• ( 2 - 8 ) 

Now observe from (2.8) that the matrix QiSn consists simply of the k pivot columns that are listed 
first in Jc. We define this quantity as the m x k matrix 

C:=A(:,Jc(l:A:)) = QiSii. (2.9) 


Moreover, we define a column interpolation matrix V as the n x k matrix 



Inserting (2.9) and (2.10) into (2.7), we find the expression 

A — CV* 

^approx — * • 


( 2 . 10 ) 


( 2 . 11 ) 


Equation (2.11) is known as a column ID of rank k of A. Heuristically, the column ID identifies a 
subset of the columns of A (the k columns identified in Jc(l : k)) that serve as an approximate basis 
for the column space of the matrix. 
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The approximation error A — CV* is identical to the error in the partial column pivoted QR 
factorization, since CV* = Aapprox, where Aapprox is the approximant defined by (2.5). Consequently, 

||A - CV*|| = ||A - Aapproxll = IIS22II. ( 2 . 12 ) 


Just as a column ID can be derived by orthonormalizing the columns of A via a QR factorization, 
we can also derive a row ID by orthonormalizing the rows of A. Performing a A:-step QR factorization 
of A*, we end up with an approximate factorization 


A pa W R, 

m X n m X k k x n 


(2.13) 


where R is a /c x n matrix that consists of k of the rows of A. To be precise, 

R = A(Jr(l:A:),:), 

where Jr is the permutation vector resulting from the QR factorization of A*. 


2.5 Two sided ID and CUR Decompositions 

The matrix factorizations described in Section 2.4 use either a subset of the columns as a basis for the 
column space, or a subset of the rows as a basis for the row space. Next, we will describe the two sided 
ID and the CUR decompositions which select subsets of both the columns and the rows, to serve as 
bases for both the column and the row spaces. 

To derive the two sided ID, we start by constructing a column-1 D so that we have the approximation 

(2.11) . Next, we execute a row-ID on the tall thin matrix C, to obtain a factorization 

C = WC(Jr(l : A;),:). (2.14) 

Observe that the factorization (2.14) is exact since the rank of C is at most k. Inserting (2.14) into 

(2.11) , and observing that 

C(J,(l:fc),;) = A(Jr(l:A;),Jc(l:fc)), 

we obtain the two-sided ID 

Aapprox = W A(Jr(l:A:),Jc(l:A:)) V*, 

m X n m X k k x k k x n \ ■ J 


Observe that the rank-A; approximation Aapprox remains identical to the matrix defined in (2.5), which 
means that the two-sided ID incurs exactly the same error as the column ID, and the truncated QR 
decomposition. 

The popular CUR decomposition takes the form 


A 


CUR, 

m X k k X k k x n 


(2.16) 


m X n 
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where the matrices C and R consists of k columns and rows of A, respectively, just as in Section 2.4. 
The CUR decomposition can be obtained from the two-sided ID. As a first step, we use the index 
vector Jr and Jc in (2.15) to define 

C = A(:, Jc(l : k)), and R = A(Jr(l : fe),:). 

Then observe that since the matrix C is the same in (2.16) as it is in (2.11), we can convert the 
approximation (2.11) into a CUR decomposition if we can determine a k x k matrix U such that 

UR = V*, (2.17) 

The system (2.17) is overdetermined, and solving it typically incurs an additional error. Consequently, 
the approximation error in (2.16) is typically larger than the approximation error beyond the error 
incurred in the original QR factorization. 

2.6 Specific rank and tolerance based decompositions 

Notice that each of the discussed decompositions (SVD, ID, and CUR) can be computed either to 
a certain fixed rank A: or to a tolerance TOL. For each decomposition, the fixed rank k refers to 
the size of the product matrices in each decomposition. We can also use a tolerance to determine 
the decomposition size. In the case of the low rank SVD, given TOL, we can compute k such that 
(^k+i < TOL. The one and two sided ID decompositions are based on the pivoted QR factorization 
and the CUR decomposition is based in turn on the two sided ID. For these decompositions, given 
a parameter TOL, we can perform a sufficient number of steps in the QR factorization to obtain 
IIS22II < TOL. 

3 Randomized Algorithms 

The classical algorithms for the factorizations discussed in Section 2 may be too costly for matrices 
with a large memory footprint. However, the algorithms to obtain all the factorizations we have 
discussed; the low rank SVD, the ID, and CUR factorizations, can be substantially accelerated by 
means of randomized sampling, with relatively small accuracy tradeoffs [8]. 

The factors Ufc, and in the partial SVD of a matrix A, cf. (2.2), can be computed by 
constructing the full SVD of A using standard software routines such as, e.g., those available in a 
LAPACK implementation, and truncating. However, this is quite costly, with an asymptotic cost of 
0{mnr) where r = min(m, n). In contrast, all techniques presented here have an asymptotic cost of 
0{mnk), which represents a substantial savings when k <C min(m, n). Notice that both the two sided 
ID and CUR factorizations can be computed efficiently once a single sided ID has been computed, so we 
concern our discussion and analysis on using randomized sampling for computing the rank k SVD and 
the rank k ID decompositions. 

The idea behind randomized algorithms for constructing low rank approximations to a matrix is to 
apply the desired factorization to a smaller matrix, derived from the original matrix. We first discuss 
the use of randomization for constructing an approximate low rank SVD, which we elaborate more on 


in 3.4. Given A G we can construct samples of the column space of A via the computation 

Y = AJ2 with n an n X {k + p) Gaussian random matrix so that Y is of size m x (k + p) with k 
the desired rank and p a small oversampling parameter, the use of which considerably improves the 
approximation error. We can then construct a matrix with orthonormal columns (ON) Q of size 
m X (k + p) via a QR factorization of Y. If Y captures a good portion of the range of A (assuming k is 
sufficiently large relative to the numerical rank of A), then we expect that QQ*A ~ A. The idea is to 
compute the factorization of the smaller (k+p) x n product matrix Q*A and then multiply the result 
by Q to obtain an approximation of A. Note that while Q*Q = I, the matrix product QQ* multiplies 
out to the identity only in the case that Q is a square orthogonal matrix. The product P = QQ* is a 
projector onto the range of Q and hence onto that of Y (where we assume that Q is obtained via a 
compact QR factorization of Y). To verify this, it is easy to show that = P and for any v e no), 
Pv = V. In the case that Y (and hence Q obtained from Y) captures the entire range of A, we have 
equality of QQ*A and A, per the lemma below: 

Lemma 3.1 Let A G q ^ -^mxr orthonormal matrix. Then the following are 

equivalent: 

( 1 ) 7^(A) C 7^(Q) 

(2) A = QQ*A 


Proof. Assume (1) holds. Then this implies that there exists a matrix S G such that A = QS. 

It follows that: 

QQ*A = QQ*QS = QS = A, 

since Q*Q = /. Hence, (1) =i> (2). Next, assume (2) holds. Then: 

A = QQ*A = Q (Q*A) ^ 7^(A) C 7^(Q). 

Hence, (2) => (1). □ 


An extension of lemma 3.1 states that when 7^(Q) is close to 7^(A) (that is, when Q captures much 
of the range of A), then QQ*A ~ A. To make this statement more precise, we summarize here some 
results from [8]. Suppose we take a GHD (Gaussian independent identically distributed) matrix Q of 
size n X I, where I = k + p with k being the rank of the approximation we seek and p being a small 
oversampling parameter. We may split the SVD of A as: 


k 

n—k 


n 

\ 2:1 

0 

k 


0 

_ 1 

^2 . 

n—k 

L ''2 J 


Now let f2i = and Q .2 

corresponding ON matrix as: 


[u] 


= V^f2. We may use these to write the sample matrix Y and the 


Af2 


[u] 


k 

n—k 


Zifii 

^ 2^22 
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Q = orth(A). 








Since 7Z(Q) = 7Z(Y) (orth is implemented via a compact QR factorization), the projector onto the 
range of Y, Py, is equivalent to the projector onto the range of Q. Hence, Py = QQ* and A —QQ*A = 
A — PyA = (I — Py)A. In [8], it is shown that if Qi has full row rank then the approximation error 
satisfies: 

||(l-Py)A|| < \\T2f + \\Z2n2^\f, (3.1) 

in both the spectral and Frobenius norms. In (3.1), when A has precisely rank k, then Z 2 = 0 and 
the right hand side is zero so that QQ*A = A and so Y will capture the range of A. As k approaches 
the rank of A, Y will capture more and more of 'l^(A) and the matrix product QQ*A will approach 
A. If we have an ON matrix Q such that: 


||(I-QQ*)A|| <e. 


(3.2) 


then we can perform a factorization of Q*A which is of size {k + p) x n and multiply by Q to obtain 
an approximate factorization of A with the same approximation error (in the same norm) as indicated 
by (3.2). Assuming k <C min(m, n), the matrix Q*A is substantially smaller than A. For example, we 
can perform the SVD (of full rank A: + p) of B = Q* A and then multiply by Q to get an approximate 
SVD of A: 

B = 0dV* ^ ARi(Q0)DV* 


Notice that instead of performing the SVD of B, we can also perform e.g. a pivoted QR factorization 
of B to get an approximate pivoted QR factorization of A: 


BP = QR ^ AP Ri (QQ)R 


For either approximation, the bound ||(l — QQ*)A|| is the error bound between the obtained factor¬ 
ization and the original matrix, since the factorization of B = Q*A is exact. In the software we 
provide, we implement two types of randomized routines: plain randomized and block randomized. 
The plain randomized routines form Q out of the matrix of samples Y using a single QR factorization 
of the matrix Y = AQ (or (AA* )'^ A, as we discuss in 3.2). This can involve operations with large 
matrices depending on the desired rank k. The block randomized routines form the matrices Q and B 
in a blocked fashion using multiple QR factorizations and matrix multiplications of smaller matrices 
controlled by a block size parameter, by employing the iterative procedure from [14]. 


3.1 Column norm preservation 


Another useful aspect of randomized sampling concerns the preservation of column norm variations 
in a matrix. Suppose A is m x n and we draw an I x m GIID matrix fi. Suppose we then form the 
I X n matrix Z = QA. We can then derive the following expectation result relating the column norms 
of A and Z: 


Z(:V-)in 
A(:,j)Pj ■ 


(3.3) 


The result follows from the following lemma, where we make use of the construction mentioned in [3] . 
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Lemma 3.2 Let Q G be a matrix with GIID entries. Then for any a G M™' we have that 

= I and Var[Jgp] = 21. 


<2 

Proof. Since has GIID entries, Qij N(0, 1) and hence, ||n(:,j)|p = being the sum of 

squares of m iid Gaussian random variables is distributed as Xm (Ghi-squared distribution of degree 
m). It follows that -E[||nej|p] = m. Note that if Q = ig an orthogonal matrix (that, is 

QQ* = Q*Q = I) , then the matrix W = QQ which islxn, is also GIID. This means that Wjj ~ N(0,1). 

i 

It follows that ||W(.= S'"! ~ xf. Hence, E [||W(, = 1. Now let Uq be the unit vector 

i=l 

corresponding to a, that is Ua = ||f|| and let Q_l G be a complement giving a full orthogonal 

basis so that Q = [ua Q±] is an orthogonal matrix. Since Q Q = /, we must have: 


Q*Q 





<Q± 


1 0 

QlUa 

QIQ± 


0 lm-1 


It follows that: 


u*a 


u*a 


a 2 


1 

a 

Qla 

— 

a 

0 

— 

0 

= a 2^1 where ei = 

0 


Finally, we look at the quotient The numerator Qa = QQQ a = WQ a where W = QQ is 

GIID since Q is orthogonal and Q a = ||a|| 2 ei. Thus, Qa = WQ a = W||a|| 2 ei = ||a|| 2 W(. It 
follows that: 


ll^ll 


2 


2 


|W 


(G)l 


Xi 




= 21 . 


□ 


From Lemma 3.2, with a = Aej = A(:,j), relation (3.3) follows. By virtue of this relation, we have 
that 

F[||Z(:,i)-Z(:,j)f] =n|A(:,i)-A(:,j)f, 

so that the smaller matrix Z = S2A derived from A preserves the column norm variations of the original 
matrix A, up to a constant multiple dependent on /. At least nominally, this explains why the pivoting 
matrix in a pivoted QR decomposition of Z is expected to also work for the larger A, which provides 
some motivation for the randomized ID algorithm which we later present. 


3.2 Power sampling scheme 

In our previous discussion in 3, the matrix Q which we use to project the matrix A into a lower 
dimensional space is constructed via compact QR factorization of Y: 

Y = Af2 ^ Q = orth(Y). 


II 


























We now describe a so called power sampling scheme, which improves the error bound on the low 
rank approximation when the tail singular values (i.e. cr^+i, ■ ■ ■, <7^) are signihcant. It is also helpful 
in cases where a matrix A with singular values ai,... ,ar has one or more large sequence of singular 
values CTq,... ,(Ti^ with 1 < ii < ip < r, where the singular values decrease slowly in magnitude. In 
such cases, it substantially helps to use a power scheme when sampling the range of the matrix A. 
Instead of forming AQ, we form the matrix Y = ((AA*)'?A) Q., where (; > 1 is an integer parameter. We 
then set Q = orth(Y). Note that A and (AA*)'?A have the same eigenvectors and related eigenvalues. 
Plugging in the SVD A = UZV*, we have: 

AA* = UZ^U* ^ (AA*)2 = UZ^U^UZ^U* = UZ^U* ^ (AA*)'^ = UZ^'^U* 

^ (AA*)'?A = UZ^‘'U*UZV* = UZ^'^+^V* 

When A is such that its trailing singular values decay slowly, the matrix (AA*)'^A with g > 1 has 
significantly faster rate of singular value decay. 

The matrix Z = (AA*)'?An can be built up by means of the following iterative procedure: 


(1) Y = Afi 

(2) for i = 1 -. q 

(3) Y ^ A*Y 

(4) Y ^ AY 

(5) end 

In practice, we may wish to orthonormalize before multiplications with A to prevent repeatedly mul¬ 
tiplying matrices having singular values greater than one. In cases where very high computational 

1 

precision is required (higher than where Cmach is the machine precision [14]), one typically needs 

to orthonormalize the sampling matrix in between each multiplication, resulting in the scheme: 


(1) Y = Afi 

(2) for i = I ■. q 

(3) Y ^ A*orth(Y) 

(4) Y ^ Aorth(Y) 

(5) end 

In many cases, orth does not need to be performed twice at each iteration. In the software, we use a 
parameter s which controls how often the orthogonalization is done (s = 1 corresponds to the above 
case, s = 2 corresponds to doing one orth operation per iteration, and greater values for s correspond 
to correspondingly less frequent orthogonalization). In the randomized algorithm we present for the 
ID computation, we multiply by a GIID matrix from the left. A similar power scheme in that case 
(which we later discuss) also offers similar benehts. 


12 


3.3 Adaptive rank approximation algorithms 

A major challenge in constructing suitable low rank approximations via randomized schemes is in the 
construction of an ON matrix Q such that QQ*A« A, since the corresponding decomposition can then 
be formed of the smaller matrix Q*A. The simplest approach to follow is to do as discussed before, 
forming Y = AS2 (we do not mention here the use of the performance improving power sampling 
scheme, to which we will return momentarily). The problem is that without advanced knowledge of 
the singular value distribution of A, it is hard to guess an optimal rank k parameter in the size of 
the n X {k + p) matrix Q.. If /c is chosen too small relative to the numerical rank of A, then given 
Q = orth(Y), obtained via a compact QR factorization of Y, QQ*A will not be close to A. If instead 
k is chosen too large, then the matrix B = Q*A, once computed (itself a possibly overly expensive 
operation because of the QR on needlessly large Y) will be large and not much time and memory 
savings could be obtained by performing the wanted factorization of B instead of the original A. 

One way to proceed, is to start with a small Y, and then increase the size of Y by a block of 
samples at a time (by appending to Y a matrix Ygm = Angm where f2sm is a small n x kstep random 
Gaussian matrix). The new enlarged matrix Y = [Y,Ysjn] can then be orthogonalized to obtain a 
larger Q using e.g. Q = gr(Y,0). One can stop when the generated Q becomes large enough so that 
||QQ*A-A|| becomes sufficiently small. The main problem with this approach is its inefficiency, given 
that the QR factorization needs to be applied repeatedly to an increasingly larger Y. 

We now discuss two more efficient algorithms for the automatic construction of suitable matrices 
Q and B = Q*A given a parameter e > 0. The algorithms are presented side by side in Figure 2, 
taken from [14]. The first algorithm QBi is the single vector version, where Q is built up a column at 
a time, and the second algorithm QB 2 is a blocked method, where Q is built up more rapidly, using 
blocks of vectors at each step. In the second algorithm, the power scheme is also employed. For both 
methods, on exit, we have that (3.2) holds. Let us first analyze the simpler single vector method. 

Lemma 3.3 At the end of iteration j of Algorithm QBi, we have: 

A(^) = (I - QjQ*)A and Bj = Q*A (3.4) 

Proof. The results can be established by induction. Notice first that ||qj|| = 1 = for all j. 

Initially, q^ G "^(A), Qi = [qj^j, Bi = [bi] = [q|A] = Q][A. After the first iteration, A^^^ = A*^°^ — 
q^bi = A — qiqiA = (I — QiQi)A. Next, q 2 G 7^(A*-^^) = 7^ ((I — q]^q];)A) G Tl{\ — qiqi) which implies 
q 2 = (I — qiq|)/x for some vector jj,. Thus, q^qi = fi*{\ — qiq];)qi = 0. It follows that: 

A(2) = - q2b2 = (I - q,ql)A - - qiq^)A = A - q^q^A - q2q*2A = (I - Q2Q^)A. 

In general, we have that (I — q^qp T and that q^ T q^ for i ^ j- Let us assume that 

A^-^) = (I — QjQpA. It follows that: 

= AO) - qy^„q;.^„AW = (I - q„+.,q;,„,)A«) = (I - q„+i,qo+„)(l - Q.Q’jA 
= (I — QjQj — q(j_|_i)q(j_,_]^))A = (I — Qj+iQ*^;^)A 
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Similarly, if we assume Bj = QjA, then we have: 


□ 


B 


(j+i) 


Q*A 




aO') 


r Q*A 1 



“ QiQ]l)A - q(j+i)A_ 




A = Q5.+,)A 


Notice that we quit Algorithm QBi precisely when ||A^-^^|| < e is small, so since we have shown that 
= (I — QjQ^A, we have on exit that (3.2) holds. 

By a similar argument, it is proved in [14] that on output, the quantities in (3.4) hold for the 
blocked scheme QB 2 . The blocked algorithm is substantially accelerated over the non-blocked version, 
since Q is amended at each iteration by a block of column vectors. The version we implement in 


Algorithm QBi 


(1) 

Qo = [ ]; Bo = [ ]; 

0 
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- T bj 

(9) 

(9) 

end while 



(10) 
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k = j. 



(11) 





(12) 
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Algorithm QB 2 
for i = 1, 2, 3, ... 

Sli = randn(n, b). 

Qi = orth(A£2,). 
for j = 1 : q 

Qi = orth(A*Qi). 

Qi = orth(AQi). 
end for 

Qi = orth(Qi - Q,Q*Qd 

= Q*A 

A = A-Q,Bi 
if |A I < s then stop 
end while 

Set Q = [Qi ■ ■ ■ Qi] and B = [Bt • ■ • B*]*. 


Figure 2: Single vector and blocked algorithms for the adaptive construction of Q. 

RSVDPACK is based on this scheme, giving a blocked algorithm for the construction of matrices Q 
and B, using a supplied tolerance e > 0. Here, we use the bar notation to represent a collection of 
block matrices, where Qi — [Ql) • ■ ■ ) Qi] and Bj = [B^;...; B*]*. Pseudocode corresponding to scheme 
QB 2 is shown in the appendix as Algorithm 7. Since the algorithm is blocked, the orth operation is 
performed only on matrices of b columns, with b being the specified block size. The parameters M and 
e control how many blocks are used: either the maximum specihed by M or when Q composed of a 
certain number of blocks is large enough so that jjQQ*A — Ajj becomes sufficiently small. Notice that 
the key to Algorithm 7 is to update the original matrix A. As we discuss below, this can be avoided 
if desired with some accuracy tradeoffs. In Algorithm 7, if we assume Q^ Qj = 0 for i ^ j (which is 
proved in [14]) and plug in B* = Q*A^*\ then at the (i)-th iteration, we obtain the recursive relations: 

A« = (I - Q,Q*)A(*-i) = (I - QiQ:)(l - Qi_iQ:_i)A(*-2) = (I - Q,Q* - Qi_iQ*_i)A('-2) 

= =(i-q,q:)a 
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Thus, this implies that on line (10) of Algorithm 7, we check the condition ||(l — QjQ*)A|| < e, so 
that if we make M large enough, we quit the loop precisely when QjQ*A ~ A as in (3.2). The 
reorthonormalization procedure on line ( 8 ) of Algorithm 7 is necessary to avoid loss of accuracy, but 
does not necessarily need to be performed at each iteration. 

In practice, the blocked scheme in Algorithm 7 can often be substantially simplified and still yield 
accurate results, particularly when A has non-linear decay of its singular values. The main drawback 
of the scheme is that the algorithm is essentially sequential, although each individual matrix-matrix 
operation can of course take advantage of parallel processing. Another disadvantage is that Algorithm 
7 requires the explicit updating of the matrix A, which requires either to modify A or make a copy of 
A as another variable, which may not be desirable when A is large. Algorithm 8 is a version which 
consists of several for loops, the first two of which can be performed in parallel. This version also 
never requires the orth operation to be performed on a large matrix and never requires the updating 
of A. In place of a big QR, we perform a projection operation on line (13), which in practice, has the 
tendency to keep the columns of Q almost orthonormal with respect to each other. Algorithm 7 is 
more accurate than Algorithm 8 and has the advantage that it can satisfy (3.2) to a given e tolerance, 
but when the singular values of A decay rapidly, both methods give plausible results. 

For very large matrices, we can make use of a blocked scheme. We can proceed by subdividing A 
into blocks along the rows. We assume here that the number of blocks is a power of two. Without 
loss of generality, we assume the use of four blocks: 


Ai 


QiBi 


1 

0 

0 

0 

6 

1_ 


Bi' 

A2 


Q2B2 


0 

0 

a 

0 


B2 

A3 


Q3B3 


0 0 Qs 0 


B3 

A4 


Q4B4 


a 

0 

0 

0 

_ 1 


B4 


We then perform QB factorizations on the blocks of the B matrix: 




B 2 


Q 12 B 12 


M( 2 ) 


B 3 

B 4 


Q34B34 


Finally, we perform a QB factorization on: 

m(3) = 

It follows that: 


Bi 2 

B34 


Q1234B 


1234D1234 


- 1 

9 

0 

0 

1 

0 




0 

1 _ 

0 

0 

1 - 

0 

0 

Q2 

0 

0 


Qi2 

0 ■ 

Bi2 


0 

Q2 

0 

0 

0 

0 

Q3 

0 


0 

0 

CO 

4 ^ 

B34 


0 

0 

Q3 

0 

1- 

0 

0 

0 

Q4 




0 

0 

0 

Q4 


= Q(3)q(2)q(1)b(1) = QB 


Qi 2 0 
0 Q 34 


Q1234B 


1234D1234 


The benefit of this formulation is that the QB algorithm can be performed on smaller matrices in par¬ 
allel. In particular, we handle the decompositions of blocks Ai,..., A 4 in parallel, following which we 
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can do in parallel the decompositions of matrices M and and finally that of The reason 

for blocking A along the rows instead of columns is that we would like to keep the orthonormality of 
the resulting matrix Q which is done by working with block diagonal matrices. Notice that the later 
factorizations of smaller matrices, can be replaced by full rank (library routine callable) QR factoriza¬ 
tions in place of low rank QBs. This hierarchical procedure is not implemented in RSVDPACK but is 
being explored as part of upcoming work. 

3.4 Randomized algorithms for the low rank SVD 

For computing the low rank SVD of rank k of matrix A G the basic randomized algorithm 

consists of the following steps: 

• Form GIID Q G with I = k + p. 

• Form sample matrix Y of size m x I via Y = Af2. 

• Orthogonalize this set of samples forming the matrix Q = orth(Y). 

• Project the original matrix into a lower dimensional one; B = Q*A, where B is / xn, substantially 
smaller than A, which is m x n. 

• Compute the SVD of the smaller matrix B = UZV*. 

• Form U = QU. 

• Form the component matrices of the approximate rank-A: SVD of A by setting; 

Ufc = U(:, 1 : k),Tk = T{l:k,l: k),Mk = V(:, 1 : k), 
so that the product UfcZfcV^ ps A. 

The first modification of the original algorithm computes Ufc and Vfc without having to take the SVD of 
the I X n matrix B, which may still be large if n (the number of columns of A) is large. Instead of 
the SVD, we can use the eigendecomposition of the smaller k x k matrix BB*. For this, we use the 
relations; 


B = UfcZfcV^ = ^c7iQ,v: ; B* = VfcZfcO^ ; Bv, = ^,0^ 


2=1 




\i=l 


vi=l 




2 = 1 


UfcDfcU 


* 


where Zfc = \/Dfc element wise. To compute the right eigenvectors Vj, we can use the following 
relations, followed by truncation to the k dominant components: 

B*Ufc = VfcZfcUfcUfc = VfcZfc ^ Vfc = B*0fcZ^\ 
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assuming all the singular values in are above zero (which is the case for k smaller than the numerical 
rank of A). The eigendecomposition of the symmetric matrix BB* can be easily performed with a 
library call or hand coded Lanczos routine. Since this matrix is small {I x 1), the eigendecomposition 
is not expensive. On the other hand, a simple yet possibly time and memory consuming step when 
B is large, is to carry out the matrix matrix multiplication BB*. When B is large, this step can be 
done in a column by column fashion via multiplication with standard basis vectors B(B*ej). A similar 
approach in this case is useful for obtaining the columns of V^. A disadvantage of using this method 
is that the matrix BB* has essentially the square of the condition number of B. As a result, very small 
singular values of A near machine precision may not be properly resolved. This is an issue only if A 
is expected to have very small singular values amongst ai,... ,ak- 

Another approach is to use a QR factorization of B* instead of forming BB*. Out of this we get a 
I X I matrix R on which we perform the SVD, instead of performing the SVD on the I x n matrix B. This 
version is based on the following calculations. Let us take the (economic form of the) QR factorization 
B* = QR. Then R is / x / and taking the SVD yields R = UZV . 

A Ri QQ*A = QB = QR*Q* = QVZU*Q* 

Thus, the low rank SVD components are Ufc = (QV) (;, 1 : /c), Z^ = Z(1 : k,l : k), = (QU) (:, 1 : k). 
For both variations of the algorithms, the power scheme with sampling matrix (AA*)'^A may be 
used to improve performance for matrices with significant tail singular values. Notice that in many 
software packages, the order of the returned eigenvalues in an eigendecomposition is opposite to that 
of the singular values in an SVD. For this reason, following an eigendecomposition call, instead of 
the first k components, the last k components must sometimes be extracted (for example, in Matlab) 
to correspond to the extraction of the most significant (in absolute magnitude) components. The 
pseudocode for the two methods is provided in the Appendix as algorithms 4 and 5. 

For either method, the upper bound on the approximation error [8] with the randomized low rank 
SVD algorithm can be large, 

||A - UfcZfcVfc||2 < V^cTfc+i, 

with respect to the optimal ak+i (in the spectral norm) given in Theorem 2.1. However, if we incor¬ 
porate the use of the power samping scheme, described in 3.2, then the approximate upper bound [8] 
improves to: 

||A - UfcZfcVfc||2 < (A:n)2(2<?+i)o-fc+i. 

For sufficiently large q, this is substantially closer to the optimal value of (Jk+i in Theorem 2.1. 
Numerical results with power sampling (using q > 2) are often very close to optimal, as illustrated in 
Section 5. 

As previously mentioned in 3.3, if a QB decomposition of A to tolerance e is obtained, such that 
IIQB - A II < e (using Algorithm QB 2 , for instance), then following one of the above procedures 
for computing the low rank SVD, we would have the approximation satisfy the same error bound: 
||UfcZfcV^-A|| < e. This is a very useful property of the QB decomposition and of the corresponding 
block randomized routines in RSVDPACK. 
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3.5 Randomized algorithms for the ID 

We now discuss the use of randomized sampling to speed up the ID computation. Observe that in 
order to compute the column ID of a matrix, all we need is to know the linear dependencies among the 
columns of A. When the singular values of A decay reasonably rapidly, we can determine these linear 
dependencies by processing a matrix Y of size i x n, where i can be much smaller than n, similar 
to what we did in the randomized algorithm for computing the low rank SVD. In the randomized 
scheme for the ID, we perform the partial pivoted QR factorization of the smaller Y rather than of 
A. Mathematical justification for this is given by the relation between the column norms of Y = S2A 
and of A in (3.3). That is, the column norm variations of A are preserved by Y. (Notice that as 
discussed in section 3, by Lemma 3.2, for any i,j between 1 and n, if we take a difference of two 
columns of A as x = A(:,i) — A(:,j) and y = Qx = Y(:,i) — it follows that E [||y|p] = /||x|p). 

Suppose that we are given an m x n matrix A and seek to compute a column ID, a two-sided ID, or 
a CUR decomposition. We can perform this task as long as we can identify an index vector Jc as in 
(2.6) and a basis matrix V ^ £nxk 

A = A(:,Jskei) V* + E 
m X n m X k k x n m x n 


where E is small. In Algorithm 1 we found Jc and V by performing a column pivoted QR factorization 
of A. In order to do this via randomized sampling, we proceed as above. First, we fix a small 
over-sampling parameter p, then draw a {k + p) x m GIID matrix fi, and form the sampling matrix 


Y = n A. 

(k + p) X n (k + p) X m m x n 


(3.5) 


We assume based on (3.3), that the space spanned by the rows of Y contains the dominant k right 
singular vectors of A to high accuracy. This is precisely the property we need in order to find both 
the vector J and the basis matrix V. All we need to do is to perform k steps of a column pivoted QR 
factorization of the sample matrix to form a partial QR factorization 

Y(:,Je) PS Q S. 

{k + p) X n (k + p) X k kxn 

Then compute the matrix of expansion coefficients via T = S(1 : fc, 1 : A:)“^S(1 : k,{k + 1) : n), or 
a stabilized version, as described in Section 2.3. The matrix V is formed from T as before, resulting 
in Algorithm 6. The asymptotic cost of Algorithm 6 is 0{mnk), like that of the randomized low 
rank SVD. However, substantial practical gain is achieved due to the fact that the matrix-matrix 
multiplication is much faster than a column-pivoted QR factorization. This effect gets particularly 
pronounced when a matrix is very large and is stored either out-of-core, or on a distributed memory 
machine. 

Finally, we notice that the power sampling scheme used for the low rank SVD, is also equally 
effective for the randomized ID scheme. In this case, since we multiply by Q on the left, we use the 
sampling matrix: 

Y = nA(A*A)T (3.6) 
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In cases where very high computational precision is required (higher than where Cmach is 

the machine precision [18]), one typically needs to orthonormalize the sampling matrix in between 
multiplications, resulting in: 


(1) Y = 

(2) for i = 1 : q 

(3) Y ^ (Aorth(Y*))* 

(4) Y ^ (A*orth(Y*))* 

(5) end 


where as before, orth refers to orthonormalization of the columns, without pivoting. The randomized 
algorithm for the ID appears as Algorithm 6 in the appendix. The error for the randomized ID ap¬ 
proximation is lower bounded by the error in the truncated pivoted QR factorization and typically 
stays reasonably close to this value when the power sampling scheme with > 1 is employed (see [18] 
and section 5). 

Analogously to the SVD, we can compute the approximate rank k ID given matrix B = Q*A, which 
allows us to use the block algorithms discussed in 3.3, for the construction of Q and B in computing 
the ID. We can compute the ID of B = Q*A to obtain: 




and then since, QB « A, it follows from multiplication of both sides by Q that A(:, Jc) ~ A(:, Jc(l : 
A:))S*. Notice, however, that in contrast to the case of the low rank SVD, if a QB decomposition to 
tolerance e is obtained, it does not imply that the resulting ID obtained from B will approximate A 
to the same tolerance. The error is in practice larger. From [8], (3.6), we have the bound: 


A(:,Jc(l:A:))V*-A|| < 


1 -|- ^1-1- 4A:(n — k) e 


for the ID obtained from B = Q*A. 


3.6 Randomized algorithms for the CUR decomposition 

Once an approximate ID is obtained with a randomized scheme, the randomized algorithm for the 
CUR proceeds as described in Section 2.5, using the results of the two sided ID factorization. Notice that 
to form an approximate two sided ID only one application of the randomized ID method is necessary. 
The subsequent ID is of a small matrix (see (2.14)), and does not need to employ randomization to 
retain efficiency. The error is again lower bounded by the truncated pivoted QR factorization of the 
same rank. We illustrate some examples in Section 5. 


4 Developed Software 

In this section, we describe the developed software which has been written to implement the random¬ 
ized algorithms for the computation of the low rank SVD, ID, and CUR routines. We have developed 
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codes for multi-core and GPU architectures. In each case, we have used well known software libraries to 
implement BLAS and certain LAPACK routines and write wrappers for various BLAS and LAPACK 
operations (e.g. vector manipulation, matrix multiplication, QR, eigendecomposition, and SVD oper¬ 
ations). The codes are written using the C programming language and are built on top of the Intel 
MKL, NVIDIA cuBLAS, and CUBA libraries. Since we created wrappers for most of the required 
matrix and vector functions, it is not difficult to port the code to use other libraries for BLAS and 
LAPACK. The codes use OpenMP, where possible, to speed up matrix-vector operations on multi-core 
systems. Each code can load a matrix from disk stored using the following simple binary format for 
dense matrices: 

1 nuni_rows (int) 

2 num_columns (int) 

3 nnz (double) 

4 ... 

5 nnz (double) 

where the nonzeros are listed in the order of a double loop over the rows and columns of the matrix. 
Note that even zero values are written in this format. It is not difficult to extend the codes to support 
arbitrary matrix formats, including those used for sparse matrices. We expect to add this functionality 
in future releases. The matrix can be loaded using the supplied function: 

1 matrix_load_from_binary_file(char *fname) 

Below, we list the main available functions for SVD, ID, and CUR computations. These routines 
can be used from simple C driver programs. Example driver programs are provided for illustration 
with the source code. We also provide a mex file interface for some of these routines to use inside 
Matlab. 

// low rank SVD 

low_rank_svd_deconip_fixed_rank_or_prec(mat A, int k, double TOL, 
int *frank, mat **U, mat mat **V); 

low_rank_svd_rand_decomp_fixed_rank(mat *A, int k, int p, int vnum, 
int q, int s, mat mat mat **V); 

low_rank_svd_blockrand_decomp_fixed_rank_or_prec(mat *A, int k, int p, double TDL, 
int vnum, int kstep, int q, int s, int *frank, mat mat mat **V); 

// one sided ID 

id_decomp_fixed_rank_or_prec(mat *A, int k, double TOL, int *frank, vec mat **T) 

id_rand_decomp_fixed_rank(mat *A, int k, int p, int q, int s, vec mat **T); 

id_blockrand_decomp_fixed_rank_or_prec(mat *A, int k, int p, double TOL, 
int kstep, int q, int s, int *frank, vec **I, mat **T); 

// two sided ID 

id_two_sided_decomp_fixed_rank_or_prec(mat *A, int k, double TOL, 
int *frank, vec **Icol, vec **Irow, mat **T, mat **S); 


20 



id_two_sided_rand_decomp_fixed_rank(mat *A, int k, int p, int q, int s, 
vec **Icol, vec **Irow, mat **T, mat **S); 
id_two_sided_blockrand_decomp_fixed_rank_or_prec(mat *A, int k, int p, double TOL, 
int kstep, int q, int s, int *frank, vec **Icol, vec **Irow, mat **T, mat **S); 

// CUR 

cur_decomp_fixed_rank_or_prec(mat *A, int k, double TOL, 
int *frank, mat **C, mat =i=*U, mat **R) ; 
cur_rand_decomp_fixed_rank(mat *A, int k, int p, int q, int s, 
mat **C, mat mat **R); 

cur_blockrand_decomp_fixed_rank_or_prec(mat *A, int k, int p, double TOL, 
int kstep, int q, int s, int *frank, mat **C, mat **U, mat **R); 

We now describe the functions and their parameters. The functions for SVD, ID, and CUR have 
a similar calling sequence. We provide a routine which computes the full truncated decomposition, 
as well as routines for randomized and block randomized approximate versions, either to a fixed rank 
or to a specified tolerance level. We now describe in detail the routines for the low rank SVD. The 
function: 

1 low_rank_svd_decomp_fixed_rank_or_prec(mat *A, int k, double TOL, 

2 int ^frank , mat **U, mat **S, mat ^^V); 

computes, using the full SVD of A, either the rank k low rank SVD of A by truncating to the first 
k components or computes the low rank SVD to given precision TOL such that the singular value 
CTfc_|_i < TOL. The routine returns a fixed rank result if the parameter A: > 0 is supplied or computes 
the desired rank if A; < 0 and TOL is supplied instead. The output rank is then written in frank. The 
output matrices U, S, V contain the components of the computed low rank SVD. The routine: 

1 low_rank_svd_rand_decomp_fixed_rank(mat *A, int k, int p, int vnum, 

2 int q, int s, mat **U, mat **S, mat **V); 

computes the low rank SVD of rank k using randomized sampling. Here the parameter vnum (short 
for version number), indicates which routine to use for computation, either the method using the 
BB* matrix, or the QR decomposition method, as discussed in Section 3.4 (vnum equal to 1 for the 
QR method or to 2 for BB* method). The parameter p corresponds to oversampling (usually a small 
number 5 < p < 20 relative to k). Parameters q and s correspond to the power sampling scheme with 
q being the power and s controlling the amount of orthogonalizations (s = 1 corresponds to the most 
orthogonalizations after each multiplication with A). Typically, one uses 1 < g < 5. Next, the routine: 

1 low_rank_svd_blockrand_decomp_fixed_rank_or_prec(mat ^A, int k, int p, 

2 double TOL, int vnum, int kstep, int q, int s, 

3 int *frank, mat **U, mat **S, mat **V); 

uses the block randomized algorithm to construct the approximate QB decomposition of A, and from 
that construct the low rank SVD, as explained in Section 3.4. The decomposition is computed either 
to specific rank ss (k + p) using a computed number of blocks (ss ifstep ) ^^^h of size kstep, or is 
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computed such that ||A — QB|| < TOL if A: = 0 is supplied. Parameter vnum again corresponds to 
the routine to use on B as above and q, s correspond to the power sampling scheme. If rank k > 0 
is supplied, the resulting decomposition components U,S,\/ are truncated to rank k. Otherwise, in 
TOL model, k and the output parameter frank is set to the output size of B (which depends on the 
supplied tolerance TOL). 

For the ID and CUR routines, TOL mode is based on (2.5). When A; > 0 is not satisfied, the tolerance 
based algorithms determine the QR decomposition rank such that IIS 22 II < TOL. In the block rand 
scheme, the fraction is used to determine the truncation factor of the pivoted QR decomposition 
of B, whose size depends on the supplied tolerance TOL. The output size of the QR components is 
written to frank. 


5 Performance Comparisons 

We now present some performance comparisons. For our tests, we form m x n matrices of the form 
A = U DV* , where U and V are randomly drawn matrices with orthonormal columns and where D is a 
diagonal matrix with the desired singular value distribution. We use different logspaced distributions 
of singular values, which in Matlab notation are generated using the commands logspace(0, —0.5,r), 
logspace(0, —2, r), and logspace(0, —3.5, r) with r = min(m,n). We refer to the resulting matrices 
as types I, II, and III. First, we present some comparisons regarding the accuracy of the low rank 
SVD and ID computations obtained with different powers of the power scheme for the three types of 
matrices. In Figure 3, we plot the approximation errors vs k obtained using the RSVD 

algorithm (with QR method) with different powers q for the power sampling scheme. We observe that 
the use of the power scheme iterations {q > 1), gives substantially closer to optimal results than q = 0. 
On the left, we present an example where the singular values fall off relatively slowly, and as a result, 
each increase in q has a noticeable effect. In particular, (7 = 2 is sufficient for a good approximation 
and gives better performance than q = 1. For matrices with rapidly decreasing singular values (type 
III) on the right of Figure 3, we observe that g = 1 is sufficient for a good approximation. 




Figure 3: Approximation errors vs rank k obtained via the RSVD algorithm with g = 0,1, 2 for better 
and worse conditioned matrices. 
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Next, we compare the approximation errors we obtain for the different low rank approximation 
algorithms in Figure 4 for matrices of type II and III. Recall that based on our discussion in Section 
2, for a fixed rank k, the QR and single and two sided ID decompositions share the same error term 
(larger than that of the optimal truncated SVD), while the CUR adds a bit of additional error. When 
we use randomized algorithms to compute the approximate decompositions, the CUR often yields 
slightly better results than the approximate ID. We also observe in Figure 4 that for matrices with 
relatively slow singular value decay, the randomized ID and CUR algorithms give poor approximations 
relative to the low rank SVD. On the other hand, for more rapid singular value decay, the ID and 
CUR approximations give better results. 



rank k rank k 

Figure 4: Approximation errors vs rank k obtained via the RSVD, RID, and RCUR algorithms with 
q = 2 for better and worse conditioned matrices. 

However, the low rank ID and CUR approximations may require less storage space than the low 
rank SVD. In Figure 5, we plot the number of nonzeros versus rank k in the matrices of the low rank 
factorizations for a full and sparse input matrix. When the matrix A is not sparse, the low rank 
SVD and the CUR factorizations require the same storage space (in terms of the number of nonzeros) 
for a given rank. The ID requires less storage because part of matrix V in the ID is the k x k identity 
matrix. The situation changes dramatically when A is a sparse matrix. On the right of Figure 5, we 
illustrate the situation for a 5% nonzero sparse matrix. Both the CUR and ID use less nonzeros than 
the low rank SVD, for a fixed rank k. In fact, for the ID, the storage size decreases as the identity 
matrix occupies a greater portion of V. Hence, especially in the case of sparse matrices, for the same 
amount of memory, we can use an ID or CUR approximation of higher rank k than for an SVD. 

We also illustrate the runtimes of the different functions. We perform our runs on a PC containing 
an Intel Xeon E5-2440 chip (6 cores, up to 2.90 GHz) and an NVIDIA Tesla K40c graphics card 
and we use a 6000 x 12000 dense matrix (we obtain similar results for 12000 x 6000 matrices). In 
Figure 6, we include the runtimes for the low rank factorizations obtained with the randomized and 
block randomized algorithms and the corresponding full factorizations. Especially with a GPU, great 
time savings are obtained, since matrix-matrix multiplication on GPU is cheap because it is very well 
parallelizable. The only current disadvantage of GPUs is a lack of high memory, which puts a limit 
on the matrix size possible to transfer onto the card. Note that on both architectures, the block 
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rank k rank k 

Figure 5: Total nonzeros in different factorizations for full and sparse matrices, as a function of the 
rank. 

randomized QB based methods are slower than their randomized counterparts. The advantage of 
QB however, is the ability to specify a tolerance error bound for the factorization to satisfy. One also 
avoids this way the factorization and multiplication of large matrices, which can be an advantage for 
a large A. 

In Figure 6, we also plot the runtimes of the full SVD, full QR, and an estimated lower bound for a 
partial pivoted QR factorization (this routine is implemented in RSVDPACK, but is currently slower 
than optimal). Notice that both the full SVD and QR are too expensive to compute if only a low 
rank factorization is desired. On the other hand the partial QR (obtained by doing only k iterations 
of Gram-Schmidt) is somewhat competitive on the CPU, although the resulting approximation error 
bound is of course higher than that of the low rank SVD. On the GPU, however, all randomized 
schemes show significantly lower runtimes. This is clearly explained by the difference in matrix matrix 
multiplication times plotted in row two of Figure 6. Since our algorithms are heavily based on matrix 
matrix multiplications, they perform well on the GPU. Also in Figure 6, we compare the runtimes of the 
Matlab mex interface routines we include for nx n matrices (of type II) using ranks k = 100, 300, 500 
to the PROPACK [II] package partial SVD library function lansvd, also partially accelerated by means 
of mex file subroutines. We observe a significant speedup with our mex interfaced routine. 
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100 


RUNTIMES for 6kx12k matrix on CPU 


100 


RUNTIMES for 6kx12k matrix on GPU 





RUNTIMES of QB schemes for k=800, nxn matrices 




matrix dimension n 


Figure 6: First row: runtimes of SVD, QR, partial QR, and randomized and block randomized SVD and 
ID factorizations for a 6000 x 12000 matrix on CPU and GPU. Second row: runtimes of matrix matrix 
multiplication on CPU and GPU (and GPU with memory transfer from RAM) for nxn matrices. 
Third row: runtimes of rsvdpack (rp) mex interface (with MKL) for low rank SVD of n x n matrices 
with different k vs propack (pp) lansvd mex accelerated algorithm. 
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6 Conclusions 


This article presents the mathematical details of RSVDPACK: an open source software package for 
efficiently computing low rank SVD, ID, and CUR factorizations of matrices. The package currently 
provides the end user with functions to perform each factorization using a non-randomized, random¬ 
ized, or block randomized algorithm using an input rank A: or a tolerance value. The package is divided 
into several C codes supporting both multi-core and GPU architectures and provides a Matlab mex file 
interface. The provided routines should be suitable for a wide range of applications and the package 
can be easily modified to match the necessary input/output formats. The software package we present 
is frequently being updated and enhanced with new functionality. Currently, we are further extending 
our set of accelerated routines for Matlab by means of mex files and planning for the development of 
routines targeting very large matrices using distributed memory computation. We also plan to add 
support for sparse matrices and further improve performance of the routines on GPUs. 


7 Availability of software 

The latest version of the open source software can be obtained from https: //github. com/sergeyvoronin 
and is available with a GNU GPL v3 license. The multi core implementation relies on the Intel MKL 
library available from https://software.intel.com/en-us/intel-mkl. The GPU accelerated im¬ 
plementations rely on MKL and the CULA dense library available from http: //www. culatools. com/, 
built atop NVIDIA’s CUBA framework or on MKL and NVIDIA cuBLAS (available with CUBA from 
https://developer.nvidia.com/cuda-zone). The MKL and CULA libraries are not open source, 
but are available under a variety of licensing terms. 
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A Summary of algorithms 


In this section, we present the pseudocode for the different algorithms discussed in the text. We first 
present the ID (Algorithm 1), the two-sided ID (Algorithm 2), and the CUR decomposition (Algorithm 
3) without the use of randomization, as they were discussed in section 2. Notice that all of these 
algorithms rely on the rank k pivoted QR factorization, which is usually not provided as a built in 
function in existing LAPACK packages. The low rank SVD of rank k is trivial to obtain without 
randomization, from the full SVD. 


Algorithm 1: A rank k ID decomposition 
Input : A G and parameter k < mm{m,n). 

Output: A column index set J and a matrix V G gach that A ps A(:, J(1 : k))\/*. 

1 Perform a rank k column pivoted QR factorization to get AP ~ QiSi; 

2 define the ordered index set J via !(:, J) = P; 

3 partition Si: Sn = Si(:, 1 : k), S 12 = Si(:, A; -|- 1 : n); 

4V = P[lfc Sr/Si2]*; 


Algorithm 2: A rank k two sided ID decomposition 
Input : A G and parameter k < min(m,n). 

Output: A column index set J, a row index set I and a matrices V G and 

W G such that A « WA(/(1 : k), J(1 : k))M*. 

1 Perform a one sided rank k ID of A so that A ~ CV* where C = A(:,J(1 : k))] 

2 Perform a full rank ID on C* so that C* = C* (:,/(! : k))\N*\ 


Algorithm 3: A rank k CUR-ID algorithm 
Input : A G and parameter k < mm{m,n). 

Output: Matrices C G R G and U G (such that A Ri CUR). 

1 Construct a rank k two sided ID of A so that Asa WA(/(1 : A:), J(1 : A:))V*; 

2 Construct matrices C = A(:, J(1 : k)) and R = A(I(1 : k ),:); 

3 Construct matrix U via U = V*Rt; 


Next, we present the randomized algorithms for computing the approximate low rank SVD, the ID, 
and QB decompositions. First we present the two low rank SVD methods from section 3.4. Algorithm 
4, which uses the eigendecomposition of the small BB* matrix in place of the SVD of B, and Algorithm 
5 which uses a QR factorization of B* to construct a small matrix R on which the SVD is performed. 
Notice that for very large matrices A, even the corresponding smaller matrix B (which will in general 
not be sparse even if A is) may still be too large to multiply. A derivative of Algorithm 4 can be used in 
this case. The matrix matrix product BB* can be evaluated a column at a time (via multiplication with 
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standard basis vectors, as in BB*ej) when B is too large. The same can be done for the computation 
of Vfc via the matrix-matrix product B*UZ^^. 

Notice that for both Algorithms 4 and 5, the largest (by magnitude) k singular value components 
are extracted at the end of the procedure. However, in some software packages (like Matlab) the 
eigenvalue ordering for the eigendecomposition is opposite to that of the singular value ordering for 
the SVD. For this reason, the last line of Algorithm 4 shows the last k components being extracted. 
Notice that steps 14 and 15 of both methods can be combined to yield more efficient computations 
with smaller matrices (e.g. Ufc = QU(:,(p+l):0). 

Next, we show the pseudocode for the randomized ID decomposition based on the discussion in 
section 3.5. Algorithm 6 can be further enhanced by using the power sampling scheme, as discussed 
in 3.5. 

Finally, in Algorithms 7 and 8, we present two randomized variants of the QB decomposition 
discussed in section 3. Of these. Algorithm 8 is only approximate, but has the advantage that it can 
be further parallelized than Algorithm 7 and does not require the update of the original (or copy of) 
matrix A. 


Algorithm 4: RSVD Algorithm Version I 


Input : A G integer rank parameter k < min(m, n), an integer 

oversampling parameter p > 0, an integer power sampling parameter 
q >0, and an integer re-orthogonalization amount parameter s > 1. 
Output: Matrices G G and G 

Set I = k + p and initialize a matrix R G R^^^ with Gaussian random entries; 
Form samples matrix Y = AR and utilize optional power scheme; 

for j = 1 to q do 


if mod ((2j — 2), s) == 0 then 
I [Y,.] = qr(Y,0); 

Z = A*Y; 

if mod ((2j — 1), s) == 0 then 
I [Z,-] = qr(Z,0); 

Y = AZ; 

10 Orthonormalize the columns of Y in [Q,-] =qr(Y,0); 

11 Obtain the smaller matrix B = Q*A derived from A; 

12 Form the even smaller I x I matrix T = BB*; 

13 Perform eigendecomposition oi I x I matrix [U, D] = eig(T) ; 

14 Form the approximate low rank SVD components of A using the results of the 

eigendecomposition Z^ = \/D, = QU, = B*UZ^^; 

15 Extract components corresponding to the k largest by magnitude singular 
values 

Ufc = Ufc(:, (p+1) : Oi = ^fc((p+l) : (p+l) • Oi ^k = Vfc(:, (p+1) : 1)', 
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Algorithm 5: RSVD Algorithm Version II 
Input : A G integer rank parameter k < min(m, n), an integer 

oversampling parameter p > 0, an integer power sampling parameter 
q >0, and an integer re-orthogonalization amount parameter s > 1. 
Output: Matrices Llfc G G and G 

1 Set I = k + p and initialize a matrix R G with Gaussian random entries; 

2 Form samples matrix Y = AR and utilize optional power scheme; 

3 for j = I to g do 

4 if mod ((2j — 2),s) == 0 then 

5 I [Y,.] = qr(Y,0); 

6 Z = A*Y; 

7 if mod ((2? — 1), s) == 0 then 

8 I [Z,-] = qr(Z,0); 

9 Y = AZ; 

10 Orthonormalize the columns of Y in [Q,-] =qr(Y,0); 

11 Compute the smaller matrix Bt = A*Q; 

12 Obtain the small I x I matrix R using a compact QR factorization 
[Q,R] = qr(Bt,o); 

13 Take the SVD of the I x I matrix R, [0, Z^, V] = svd(R); 

14 Form the approximate low rank SVD components of A using the results of the 
SVD of R. Ufc = QV, Vfc = QU; 


15 Extract components corresponding to the k largest by magnitude singular 
values Ufc = Ufc(:, 1 : A:); Z^. = Zfc(l :/c, 1 :/c); V^. = Vfc(:, 1 ;/c); 



Algorithm 6: A randomized rank k ID decomposition 
Input : A G a rank parameter k < min(m, n), and an oversampling parameter p. 

Output; A column index set J and a matrix V G (such that A ps A(:, J(1 : k))\/*). 


1 Construct a random matrix Q. G with i.i.d. Gaussian entries; 

2 Construct the sample matrix Y = f2A; 

3 Perform full pivoted QR factorization on Y to get: YP = QS; 

4 Remove p columns of Q and p rows of S to construct Qi and Si; 

5 Define the ordered index set J via l(:, J) = P; 

6 Partition Si: Sn = Si(;, 1 : A), S 12 = Si(;, A + 1 : n); 

7V = P[lfc Sr/Si2]*; 







Algorithm 7: A randomized blocked QB decomposition 
Input : A G integer block size parameter b, maximum number of blocks M, and double 

tolerance parameter e. 

Output: Matrices Q G B G s.t. ||QB — A|| < e (if M large enough). 

1 for i = 1 to M do 

2 fli = randn(n, 6); 

3 Qj = orth(Afij); 

4 for j = 1 to q do 

5 Qi = orth(A*Qj) ; 

6 Qi = orth(AQi) ; 

7 Q, = orth(Qi-E5-lQjQ,*Q*); 

8 B, = Q*A; 

9 A = A-QiBi; 

10 if ||A|| < e then break; 

11 Q = [Qi ••• Qi] and B = [B^ ••• B*]*; 



Algorithm 8: An approximate parallelizable randomized blocked QB decomposition 


Input : A G integer block size parameter b, maximum number of blocks M. 

Output: Matrices Q G 3 ^ qB _ 

1 Q = []; 

2 for i = 1 to M do 

3 = randn(n, 6); 

4 Yi = Af2*; 

5 for j = 1 to q do 

6 Qj = orth(Yi); 

7 Y, = A*Q,; 

8 Qi = orth(Yi); 

9 Yi = AQi; 

10 for i = 1 to M do 

11 I Qi = orth(Yj); 

12 for i = 1 to M do 

13 Qj = Qj - QQ*Qj; 

14 Qj = orth(Qi); 

15 Q = [Q,Q*]; 

16 B = Q*A; 







